{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Introduction\n",
    "\n",
    "Text Classification can be used to solve various use-cases like sentiment analysis, spam detection, hashtag prediction etc. This notebook demonstrates the use of SageMaker BlazingText to perform supervised binary/multi class with single or multi label text classification. BlazingText can train the model on more than a billion words in a couple of minutes using a multi-core CPU or a GPU, while achieving performance on par with the state-of-the-art deep learning text classification algorithms. BlazingText extends the fastText text classifier to leverage GPU acceleration using custom CUDA kernels."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup\n",
    "\n",
    "Let's start by specifying:\n",
    "\n",
    "- The S3 bucket and prefix that you want to use for training and model data. This should be within the same region as the Notebook Instance, training, and hosting. If you don't specify a bucket, SageMaker SDK will create a default bucket following a pre-defined naming convention in the same region. \n",
    "- The IAM role ARN used to give SageMaker access to your data. It can be fetched using the **get_execution_role** method from sagemaker python SDK."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "isConfigCell": true
   },
   "outputs": [],
   "source": [
    "import sagemaker\n",
    "from sagemaker import get_execution_role\n",
    "import json\n",
    "import boto3\n",
    "\n",
    "sess = sagemaker.Session()\n",
    "\n",
    "role = get_execution_role()\n",
    "print(role) # This is the role that SageMaker would use to leverage AWS resources (S3, CloudWatch) on your behalf\n",
    "\n",
    "bucket = sess.default_bucket() # Replace with your own bucket name if needed\n",
    "print(bucket)\n",
    "prefix = 'blazingtext/supervised' #Replace with the prefix under which you want to store the data if needed"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Data Preparation\n",
    "\n",
    "Now we'll download a dataset from the web on which we want to train the text classification model. BlazingText expects a single preprocessed text file with space separated tokens and each line of the file should contain a single sentence and the corresponding label(s) prefixed by \"\\__label\\__\".\n",
    "\n",
    "In this example, let us train the text classification model on the [DBPedia Ontology Dataset](https://wiki.dbpedia.org/services-resources/dbpedia-data-set-2014#2) as done by [Zhang et al](https://arxiv.org/pdf/1509.01626.pdf). The DBpedia ontology dataset is constructed by picking 14 nonoverlapping classes from DBpedia 2014. It has 560,000 training samples and 70,000 testing samples. The fields we used for this dataset contain title and abstract of each Wikipedia article. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "!wget https://github.com/saurabh3949/Text-Classification-Datasets/raw/master/dbpedia_csv.tar.gz"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!tar -xzvf dbpedia_csv.tar.gz"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us inspect the dataset and the classes to get some understanding about how the data and the label is provided in the dataset. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!head dbpedia_csv/train.csv -n 3"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As can be seen from the above output, the CSV has 3 fields - Label index, title and abstract. Let us first create a label index to label name mapping and then proceed to preprocess the dataset for ingestion by BlazingText."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next we will print the labels file (`classes.txt`) to see all possible labels followed by creating an index to label mapping."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!cat dbpedia_csv/classes.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following code creates the mapping from integer indices to class label which will later be used to retrieve the actual class name during inference. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "index_to_label = {} \n",
    "with open(\"dbpedia_csv/classes.txt\") as f:\n",
    "    for i,label in enumerate(f.readlines()):\n",
    "        index_to_label[str(i+1)] = label.strip()\n",
    "print(index_to_label)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data Preprocessing\n",
    "We need to preprocess the training data into **space separated tokenized text** format which can be consumed by `BlazingText` algorithm. Also, as mentioned previously, the class label(s) should be prefixed with `__label__` and it should be present in the same line along with the original sentence. We'll use `nltk` library to tokenize the input sentences from DBPedia dataset. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Download the nltk tokenizer and other libraries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from random import shuffle\n",
    "import multiprocessing\n",
    "from multiprocessing import Pool\n",
    "import csv\n",
    "import nltk\n",
    "nltk.download('punkt')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def transform_instance(row):\n",
    "    cur_row = []\n",
    "    label = \"__label__\" + index_to_label[row[0]]  #Prefix the index-ed label with __label__\n",
    "    cur_row.append(label)\n",
    "    cur_row.extend(nltk.word_tokenize(row[1].lower()))\n",
    "    cur_row.extend(nltk.word_tokenize(row[2].lower()))\n",
    "    return cur_row"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `transform_instance` will be applied to each data instance in parallel using python's multiprocessing module"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def preprocess(input_file, output_file, keep=1):\n",
    "    all_rows = []\n",
    "    with open(input_file, 'r') as csvinfile:\n",
    "        csv_reader = csv.reader(csvinfile, delimiter=',')\n",
    "        for row in csv_reader:\n",
    "            all_rows.append(row)\n",
    "    shuffle(all_rows)\n",
    "    all_rows = all_rows[:int(keep*len(all_rows))]\n",
    "    pool = Pool(processes=multiprocessing.cpu_count())\n",
    "    transformed_rows = pool.map(transform_instance, all_rows)\n",
    "    pool.close() \n",
    "    pool.join()\n",
    "    \n",
    "    with open(output_file, 'w') as csvoutfile:\n",
    "        csv_writer = csv.writer(csvoutfile, delimiter=' ', lineterminator='\\n')\n",
    "        csv_writer.writerows(transformed_rows)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "# Preparing the training dataset\n",
    "\n",
    "# Since preprocessing the whole dataset might take a couple of mintutes,\n",
    "# we keep 20% of the training dataset for this demo.\n",
    "# Set keep to 1 if you want to use the complete dataset\n",
    "preprocess('dbpedia_csv/train.csv', 'dbpedia.train', keep=.2)\n",
    "        \n",
    "# Preparing the validation dataset        \n",
    "preprocess('dbpedia_csv/test.csv', 'dbpedia.validation')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The data preprocessing cell might take a minute to run. After the data preprocessing is complete, we need to upload it to S3 so that it can be consumed by SageMaker to execute training jobs. We'll use Python SDK to upload these two files to the bucket and prefix location that we have set above.   "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "train_channel = prefix + '/train'\n",
    "validation_channel = prefix + '/validation'\n",
    "\n",
    "sess.upload_data(path='dbpedia.train', bucket=bucket, key_prefix=train_channel)\n",
    "sess.upload_data(path='dbpedia.validation', bucket=bucket, key_prefix=validation_channel)\n",
    "\n",
    "s3_train_data = 's3://{}/{}'.format(bucket, train_channel)\n",
    "s3_validation_data = 's3://{}/{}'.format(bucket, validation_channel)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next we need to setup an output location at S3, where the model artifact will be dumped. These artifacts are also the output of the algorithm's traning job."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "s3_output_location = 's3://{}/{}/output'.format(bucket, prefix)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Training\n",
    "Now that we are done with all the setup that is needed, we are ready to train our object detector. To begin, let us create a ``sageMaker.estimator.Estimator`` object. This estimator will launch the training job."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "region_name = boto3.Session().region_name"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "container = sagemaker.amazon.amazon_estimator.get_image_uri(region_name, \"blazingtext\", \"latest\")\n",
    "print('Using SageMaker BlazingText container: {} ({})'.format(container, region_name))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Training the BlazingText model for supervised text classification"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Similar to the original implementation of [Word2Vec](https://arxiv.org/pdf/1301.3781.pdf), SageMaker BlazingText provides an efficient implementation of the continuous bag-of-words (CBOW) and skip-gram architectures using Negative Sampling, on CPUs and additionally on GPU[s]. The GPU implementation uses highly optimized CUDA kernels. To learn more, please refer to [*BlazingText: Scaling and Accelerating Word2Vec using Multiple GPUs*](https://dl.acm.org/citation.cfm?doid=3146347.3146354).\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Besides skip-gram and CBOW, SageMaker BlazingText also supports the \"Batch Skipgram\" mode, which uses efficient mini-batching and matrix-matrix operations ([BLAS Level 3 routines](https://software.intel.com/en-us/mkl-developer-reference-fortran-blas-level-3-routines)). This mode enables distributed word2vec training across multiple CPU nodes, allowing almost linear scale up of word2vec computation to process hundreds of millions of words per second. Please refer to [*Parallelizing Word2Vec in Shared and Distributed Memory*](https://arxiv.org/pdf/1604.04661.pdf) to learn more."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "BlazingText also supports a *supervised* mode for text classification. It extends the FastText text classifier to leverage GPU acceleration using custom CUDA kernels. The model can be trained on more than a billion words in a couple of minutes using a multi-core CPU or a GPU, while achieving performance on par with the state-of-the-art deep learning text classification algorithms. For more information, please refer to the [algorithm documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/blazingtext.html)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To summarize, the following modes are supported by BlazingText on different types instances:\n",
    "\n",
    "|          Modes         \t| cbow (supports subwords training) \t| skipgram (supports subwords training) \t| batch_skipgram \t| supervised |\n",
    "|:----------------------:\t|:----:\t|:--------:\t|:--------------:\t| :--------------:\t|\n",
    "|   Single CPU instance  \t|   ✔  \t|     ✔    \t|        ✔       \t|  ✔  |\n",
    "|   Single GPU instance  \t|   ✔  \t|     ✔    \t|                \t|  ✔ (Instance with 1 GPU only)  |\n",
    "| Multiple CPU instances \t|      \t|          \t|        ✔       \t|     | |\n",
    "\n",
    "Now, let's define the SageMaker `Estimator` with resource configurations and hyperparameters to train Text Classification on *DBPedia* dataset, using \"supervised\" mode on a `c4.4xlarge` instance.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "bt_model = sagemaker.estimator.Estimator(container,\n",
    "                                         role, \n",
    "                                         train_instance_count=1, \n",
    "                                         train_instance_type='ml.c4.4xlarge',\n",
    "                                         train_volume_size = 30,\n",
    "                                         train_max_run = 360000,\n",
    "                                         input_mode= 'File',\n",
    "                                         output_path=s3_output_location,\n",
    "                                         sagemaker_session=sess)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Please refer to [algorithm documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/blazingtext_hyperparameters.html) for the complete list of hyperparameters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "bt_model.set_hyperparameters(mode=\"supervised\",\n",
    "                            epochs=10,\n",
    "                            min_count=2,\n",
    "                            learning_rate=0.05,\n",
    "                            vector_dim=10,\n",
    "                            early_stopping=True,\n",
    "                            patience=4,\n",
    "                            min_epochs=5,\n",
    "                            word_ngrams=2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that the hyper-parameters are setup, let us prepare the handshake between our data channels and the algorithm. To do this, we need to create the `sagemaker.session.s3_input` objects from our data channels. These objects are then put in a simple dictionary, which the algorithm consumes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "train_data = sagemaker.session.s3_input(s3_train_data, distribution='FullyReplicated', \n",
    "                        content_type='text/plain', s3_data_type='S3Prefix')\n",
    "validation_data = sagemaker.session.s3_input(s3_validation_data, distribution='FullyReplicated', \n",
    "                             content_type='text/plain', s3_data_type='S3Prefix')\n",
    "data_channels = {'train': train_data, 'validation': validation_data}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have our `Estimator` object, we have set the hyper-parameters for this object and we have our data channels linked with the algorithm. The only  remaining thing to do is to train the algorithm. The following command will train the algorithm. Training the algorithm involves a few steps. Firstly, the instance that we requested while creating the `Estimator` classes is provisioned and is setup with the appropriate libraries. Then, the data from our channels are downloaded into the instance. Once this is done, the training job begins. The provisioning and data downloading will take some time, depending on the size of the data. Therefore it might be a few minutes before we start getting training logs for our training jobs. The data logs will also print out Accuracy on the validation data for every epoch after training job has executed `min_epochs`. This metric is a proxy for the quality of the algorithm. \n",
    "\n",
    "Once the job has finished a \"Job complete\" message will be printed. The trained model can be found in the S3 bucket that was setup as `output_path` in the estimator."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "bt_model.fit(inputs=data_channels, logs=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Hosting / Inference\n",
    "Once the training is done, we can deploy the trained model as an Amazon SageMaker real-time hosted endpoint. This will allow us to make predictions (or inference) from the model. Note that we don't have to host on the same type of instance that we used to train. Because instance endpoints will be up and running for long, it's advisable to choose a cheaper instance for inference."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "text_classifier = bt_model.deploy(initial_instance_count = 1,instance_type = 'ml.m4.xlarge')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Use JSON format for inference\n",
    "BlazingText supports `application/json` as the content-type for inference. The payload should contain a list of sentences with the key as \"**instances**\" while being passed to the endpoint."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sentences = [\"Convair was an american aircraft manufacturing company which later expanded into rockets and spacecraft.\",\n",
    "            \"Berwick secondary college is situated in the outer melbourne metropolitan suburb of berwick .\"]\n",
    "\n",
    "# using the same nltk tokenizer that we used during data preparation for training\n",
    "tokenized_sentences = [' '.join(nltk.word_tokenize(sent)) for sent in sentences]\n",
    "\n",
    "payload = {\"instances\" : tokenized_sentences}\n",
    "\n",
    "response = text_classifier.predict(json.dumps(payload))\n",
    "\n",
    "predictions = json.loads(response)\n",
    "print(json.dumps(predictions, indent=2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By default, the model will return only one prediction, the one with the highest probability. For retrieving the top k predictions, you can set `k` in the configuration as shown below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "payload = {\"instances\" : tokenized_sentences,\n",
    "          \"configuration\": {\"k\": 2}}\n",
    "\n",
    "response = text_classifier.predict(json.dumps(payload))\n",
    "\n",
    "predictions = json.loads(response)\n",
    "print(json.dumps(predictions, indent=2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Stop / Close the Endpoint (Optional)\n",
    "Finally, we should delete the endpoint before we close the notebook if we don't need to keep the endpoint running for serving realtime predictions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sess.delete_endpoint(text_classifier.endpoint)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "conda_python3",
   "language": "python",
   "name": "conda_python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.2"
  },
  "notice": "Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.  Licensed under the Apache License, Version 2.0 (the \"License\"). You may not use this file except in compliance with the License. A copy of the License is located at http://aws.amazon.com/apache2.0/ or in the \"license\" file accompanying this file. This file is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License."
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
